======================================================================
by RUI ZHANG December 05,2017
This is the data exploration on 2016 US presidential campaign finance contribution in New York state. The dataset is available at http://classic.fec.gov/disclosurep/pnational.do, which includes the financial information disclosed by presidential candidates for their campaign and all the individual donations that were over $100,000.
In 2016 president campaign, the result unexpectedly brought Trump into White House, and Hilary missed the chance to be the first female president in US. When people naturally assumed they would see a female president next morning, this reversal astonished them without clues. So, what if we have the financial campaign data? What can we get from the dataset? I will explore the nature of campaign contributions and see if there are any interesting relationships in the data, such as: - Which party and candidates got the most support? - What’s the difference between different party's supporters? - How’s the donations distributed spatially? - Is there any model that we could build and make some predictions?
Let’s we load the data and certain packages.
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
library(dplyr)
library(psych)
library("gridExtra")
library(zipcode)
library(choroplethrZip)
library(choroplethr)
library(devtools)
library(choroplethrMaps)
library(gender)
library(lubridate)# for the year() function
library(cowplot)# Could define the size of the plot
library(polycor) # hector
getwd()
## [1] "/Users/apple/Desktop/project/Udacity/data_analyst/data_exploration_analysis/project/Financial-Data-Exploration-on-2016-President-Campaign-in-NY"
# Load the Data
#wine<-read.csv("data/wineQualityReds.csv")
#summary(wine)
#loan<-read.csv("data/prosperLoanData.csv")
#head(loan)
campaign<-read.csv("P00000001-NY.csv",row.names = NULL)
names(campaign)
# subeset the dataset - CLEARLY STATE THE VARIABLES I used
#tmpdata <- subset(campaign, select = -c(varname1, varname2))
#tmpdata <- campaign[ , !names(campaign) %in% c("varname1", "varname2")]
dim(campaign)
str(campaign)
head(campaign)
#View(campaign)
Description of Data Set: In this dataset, there are 649460 records with 18 variables. Most varaibles are catergorical.The dataset include the info about the candidate they support, the contributor name, location, career and the money they contributed in New York.
Hint: preliminary exploration of the dataset. summaries/univariate plots to understand the structure of the individual variables in your dataset.
str(campaign)
## 'data.frame': 649460 obs. of 18 variables:
## $ cmte_id : Factor w/ 25 levels "C00458844","C00500587",..: 6 6 7 7 6 7 7 6 6 15 ...
## $ cand_id : Factor w/ 25 levels "P00003392","P20002671",..: 1 1 12 12 1 12 12 1 1 23 ...
## $ cand_nm : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 20 20 4 20 20 4 4 23 ...
## $ contbr_nm : Factor w/ 119407 levels " BLACKMORE, ANDI POTAMKIN",..: 52356 19422 54646 61896 8862 63027 63080 54730 55289 91553 ...
## $ contbr_city : Factor w/ 2327 levels ""," BROOKLYN",..: 1424 287 1424 269 1620 1406 1644 1424 595 1424 ...
## $ contbr_st : Factor w/ 1 level "NY": 1 1 1 1 1 1 1 1 1 1 ...
## $ contbr_zip : Factor w/ 69028 levels "","`1136","0",..: 7135 64432 5890 40174 58413 30757 49101 5353 62817 69028 ...
## $ contbr_employer : Factor w/ 39302 levels ""," ENGENDERHEALTH",..: 23091 30023 24570 24031 16445 24882 33393 30988 23091 16445 ...
## $ contbr_occupation: Factor w/ 17219 levels ""," ADMINISTRATIVE ASSISTANT",..: 13288 1249 10332 16184 7896 15445 12064 16984 13288 7896 ...
## $ contb_receipt_amt: num 100 67 50 15 100 ...
## $ contb_receipt_dt : Factor w/ 697 levels "1-Apr-15","1-Apr-16",..: 137 367 622 577 71 599 599 229 656 22 ...
## $ receipt_desc : Factor w/ 27 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ memo_cd : Factor w/ 2 levels "","X": 2 2 1 1 2 1 1 2 2 2 ...
## $ memo_text : Factor w/ 262 levels ""," SEE REATTRIBUTION",..: 33 33 5 5 33 5 5 33 33 1 ...
## $ form_tp : Factor w/ 3 levels "SA17A","SA18",..: 2 2 1 1 2 1 1 2 2 2 ...
## $ file_num : int 1091718 1091718 1077404 1077404 1091718 1077404 1077404 1091718 1091718 1146165 ...
## $ tran_id : Factor w/ 648415 levels "A00136D649BB94F13985",..: 275346 275731 533152 531094 275039 532242 531783 275424 274830 450800 ...
## $ election_tp : Factor w/ 6 levels "","G2016","O2016",..: 5 5 5 5 5 5 5 5 5 2 ...
summary(campaign)
## cmte_id cand_id cand_nm
## C00575795:399522 P00003392:399522 Clinton, Hillary Rodham :399522
## C00577130:174564 P60007168:174564 Sanders, Bernard :174564
## C00580100: 36931 P80001571: 36931 Trump, Donald J. : 36931
## C00574624: 16785 P60006111: 16785 Cruz, Rafael Edward 'Ted': 16785
## C00573519: 6638 P60005915: 6638 Carson, Benjamin S. : 6638
## C00458844: 4813 P60006723: 4813 Rubio, Marco : 4813
## (Other) : 10207 (Other) : 10207 (Other) : 10207
## contbr_nm contbr_city contbr_st
## BODNICK, KATIE : 1326 NEW YORK :206993 NY:649460
## BRUN, GINA : 413 BROOKLYN : 86953
## BRONER, NAHAMA : 318 BRONX : 14102
## SCHWARTZ, HILARY : 311 ROCHESTER : 9985
## KILLORIN, MICHAEL: 310 STATEN ISLAND: 7431
## GRODY, GORDON : 307 BUFFALO : 6196
## (Other) :646475 (Other) :317800
## contbr_zip contbr_employer contbr_occupation
## 100015704: 1332 N/A : 82294 RETIRED : 98667
## 10024 : 1197 SELF-EMPLOYED: 67025 NOT EMPLOYED : 47994
## 10022 : 871 RETIRED : 42105 ATTORNEY : 26486
## 10023 : 864 NONE : 32333 INFORMATION REQUESTED: 16956
## 10025 : 829 NOT EMPLOYED : 21881 TEACHER : 15080
## 10128 : 745 (Other) :403497 (Other) :444199
## (Other) :643622 NA's : 325 NA's : 78
## contb_receipt_amt contb_receipt_dt receipt_desc
## Min. : -10100 6-Nov-16 : 6125 :641311
## 1st Qu.: 15 31-Oct-16: 5924 Refund : 5994
## Median : 27 26-Sep-16: 5922 REDESIGNATION TO GENERAL : 406
## Mean : 264 2-Nov-16 : 5908 REDESIGNATION FROM PRIMARY: 404
## 3rd Qu.: 100 4-Nov-16 : 5814 REATTRIBUTION FROM SPOUSE : 221
## Max. :12777706 31-Mar-16: 5768 REATTRIBUTION TO SPOUSE : 221
## (Other) :613999 (Other) : 903
## memo_cd memo_text form_tp
## :541113 :398114 SA17A:537819
## X:108347 * EARMARKED CONTRIBUTION: SEE BELOW:170844 SA18 :105647
## * HILLARY VICTORY FUND : 75710 SB28A: 5994
## *BEST EFFORTS UPDATE : 773
## * HILLARY ACTION FUND : 572
## REDESIGNATION TO GENERAL : 406
## (Other) : 3041
## file_num tran_id election_tp
## Min. :1003942 SA17A.4846: 3 : 690
## 1st Qu.:1079445 C10000499 : 2 G2016:271367
## Median :1104813 C10000663 : 2 O2016: 237
## Mean :1105477 C10091902 : 2 P2015: 1
## 3rd Qu.:1133832 C1013282 : 2 P2016:377162
## Max. :1146285 C1014914 : 2 P2020: 3
## (Other) :649447
In order to fully understand the data of presidential campaign, I would love to add two more datasets - zipcode and demographic into our dataset.
# geographic data
#data(zip.map)
#str(zip.map)
data(zipcode)
str(zipcode)
## 'data.frame': 44336 obs. of 5 variables:
## $ zip : chr "00210" "00211" "00212" "00213" ...
## $ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
## $ state : chr "NH" "NH" "NH" "NH" ...
## $ latitude : num 43 43 43 43 43 ...
## $ longitude: num -71 -71 -71 -71 -71 ...
# demographic data
data(df_zip_demographics)
str(df_zip_demographics)
## 'data.frame': 33120 obs. of 9 variables:
## $ region : chr "00601" "00602" "00603" "00606" ...
## $ total_population : num 18450 41302 53683 6591 28963 ...
## $ percent_white : num 1 4 2 0 1 0 0 1 2 0 ...
## $ percent_black : num 0 0 0 0 0 0 0 0 0 0 ...
## $ percent_asian : num 0 0 0 0 0 0 0 0 0 0 ...
## $ percent_hispanic : num 99 94 96 100 99 100 100 99 98 100 ...
## $ per_capita_income: num 7380 8463 9176 6383 7892 ...
## $ median_rent : num 285 319 252 230 334 315 285 338 400 319 ...
## $ median_age : num 36.6 38.6 38.9 37.3 39.2 38.5 40.9 36.2 42 39.7 ...
Before doing anything related to preprocessing, I would love to first select the features that we are interested in or might be helpful for the processing or analysis.
feature selection
# Clearly state the names of column
#names(campaign)
campaign <- campaign[ , !names(campaign) %in% c("cmte_id", "vcand_id","receipt_desc", "memo_cd", "memo_text", "form_tp", "file_num", "tran_id", "election_tp")]
#campaign<-campaign[,1:11]
hist(log(campaign$contb_receipt_amt),main="Histgram of the 2017 Presidential Contribution")
Dataset Issue: This dataset includes 649460 obs. of 18 variables, and there are 15 categorical variables. Some missing values are in the contbr_zip, contbr_employer, contbr_occupation, and election_tp. Also, There are some inconsistency in the contbr_employer and contributor street name that we need to preprocessed. By the way, as there is no parties info for each candidate, I will add a new variable for that.In addition, contb_receipt_amt has some negative value which we could notify in the above plot, and I will remove those values.
So, in general, we will do
Remove negative values and Nan in contribution amount fields.
Change the type of certain field, especially the date.
Clean up the zipcode field to 5 digits.
Add party info for each contribution.
Split the first name and last name in the dataset, in order to get the gender info based on the first name.
Now, I want to clean the zipcodes to 5 digits, and then we can relate them to zipcode and demographic datasets.
Finally, I will add gender info into the dataset based on the candidate first name. Not all the people could get their gender prediction as their first name might be abbreviated or non-traditional.
Summary:After processing the data and adding additional variables, the dataset has some addtional variables, and the detialed explaination is below:
date_upto_elec: Num of dates before the presidential final campaign result.
party: candidate’s political party affiliation.
contbr_FirstName: Split the contributor name and get the first name info.
contbr_LastName: Split the contributor name and get the last name info.
cand_FirstName: Split the candidate name and get the first name info.
cand_LastName:split the candidate name and get the last name info.
longitude: The contributor’s geolocation longitude.
latitude: The contributor’s geolocation latitude.
total_population: the correspondent zipcode region population.
percent_white: the correspondent zipcode region white people percentage.
percent_black: the correspondent zipcode region black people percentage.
percent_asian: the correspondent zipcode region asian people percentage.
percent_hispanic: the correspondent zipcode region hispanic people percentage.
per_capita_income: the correspondent zipcode region income per capita.
median_rent: the correspondent zipcode region median rent price.
median_age: the correspondent zipcode region median age.
gender: the gender of contributor.
As we preprocess the dataset, now we could start the univariate plot analysis. From the general perspective, Hillary was the candidate that receive most contributions,and democrat got more contributions in New York. The city with more donations is New York, and retired people is the important donation group that had a big proportion in 2017. The median of contribution in 2017 presidential campaign is 274.
For the most important field contribution amount, we could see that most of contributions are below 1000 dollars and the distribution is right-screwed distribution.
Now, let’s check the date of those contribution.
For the date info related to those donations, we could notice that there were more donations in 2016, and seems there are more people to donate except the holiday season.
In 2017 presidential campaign, there are 5 candidates are democrats, 3 are third party and all the others are republican.
As we could see, most contributions are gotten by democrats as new york is huge state for democrat supporter.
Then, for the huge part in the dataset related to contributor info, I will go through the name, location info, gender info and occupation info to find out what’s characteristic behind each party’s supporters.
First, Name info. As the length is 119407 and the total is 649460, which means most of contributors only 1-6 several contributions. So it won’t be our interest in our analysis.
# First, Name info
length(table(campaign$contbr_nm))
mean(table(campaign$contbr_nm))
Then, location info. As there are a lot discrepency in this dataset, it need time to clean it up. From the perspective of preprocessing difficulties, I will choose zip code and keep 5 digits to represent the location info.
length(table(campaign$contbr_city))#2327
table(campaign$contbr_city)
length(table(campaign$contbr_st)) #1
length(table(campaign$contbr_zip))#69031
#library(ggthemes)
#scale_colour_tableau()
As we could see,most of donators were from the New York City.
For the demographics features behind the zipcode, they might closely related to the contributors in its region. We could see most donators are from regions with more white people, more aged from 25 to 50, and income per capita from 20000 to 40000.
For the Gender Info, we could see there are more females contributors than males in new york state, but the difference is not that large. What’s the reason behind that? Is that significant different?
For the occupation info, I will show the top 25 occupations with more contributions in the 2017 political campaign.From the plot, we could observe that here are several big categories are retired, not employed,self-employed, and employed. In the employed group, lawyer, professor, CEO, and docors were more actively involoved.
In general, we could see employed, retired, unemployed actively support the presidential campaign in sequence. And for the employed people, laywer, professor, CEO are more active, of which laywer and CEO donate the most.
For the employer info, we know the info doesn’t clean up and there are several names for only one employer. It will take some work if I go thorugh to get all that clean up, and i will only focus on the top 25 employer and reduce the duplicates.
The same as the occupation info, there are four big categories- unemployed, self-employed, retired and employed, and the employed part is composed of all the companies in our dataset.So in this section,I will plot the four big categories, and then draw the top companies in the employed category with most donations. As we could see, people in several university were proacively involoved in the contribution of presidential campaign, followed by private foundation, goverment agency and some big technology companies.
Question Answering
This dataset includes 649460 obs. of 18 variables, and most of them are categorical variables (15).
From what I explored above, the main feature that i am interested are, - What’s the difference between those parties supporters?
Which party will the contributor donate to based on their background info?
What could impact the contribution amount?
What’s the best time to get the donations and where are those huge financial support coming from?
Candidiate’s name info: I could get their party info based on their name, which could help us understand a political party general support picture in New York.
Contributor’s career info: Although it is too varied, it still carried some important info like which industries they are in and how much they could earn, and all those info might be the reason that they put their donation to the certain party.
Contributor’s location info: Provide the spatial distribution of those contributor, which might have the difference between different parties.
Contributor’s name info: Although most people do not continuously donate during that campaign period, we could guess their gender info only based on the name, which might be useful to look into the gender distribution in the NY campaign contribution.
Zipcode demographics info: It will help us to understand our contributors as well as the donation they put.
For this tidy dataset, it has some inconsistency in the contbr_employer, ontbr_occupation, and the contributors’ street name and might need to preprocess. As the location info is too messy, I droped that variables and used the zipcode library to clean up represent the location.For the occupation and employer info, as there are too various, I am deciding to visualize the top 20 of them and categorizing them in several big categories like un-employer, retired, self-employer, and employed.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section?
First, check out the correlation between numerical variables.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## contb_receipt_amt date_upto_elec total_population
## contb_receipt_amt 1 Pearson Pearson
## date_upto_elec 0.1567 1 Pearson
## total_population 0.006446 -0.03311 1
## latitude -0.08131 0.06907 -0.4263
## longitude 0.04613 -0.04045 0.1552
## percent_white 0.04733 0.02832 -0.4823
## percent_black -0.0539 -0.01749 0.2398
## percent_asian 0.04005 -0.0257 0.2655
## percent_hispanic -0.05032 -0.01312 0.4094
## per_capita_income 0.1868 -0.05334 0.00238
## median_rent 0.153 -0.06877 0.158
## median_age 0.01807 0.002415 -0.4075
## party 0.1051 0.1376 -0.2007
## gender 0.04275 0.09999 0.003298
## latitude longitude percent_white percent_black
## contb_receipt_amt Pearson Pearson Pearson Pearson
## date_upto_elec Pearson Pearson Pearson Pearson
## total_population Pearson Pearson Pearson Pearson
## latitude 1 Pearson Pearson Pearson
## longitude -0.6452 1 Pearson Pearson
## percent_white 0.4226 -0.2156 1 Pearson
## percent_black -0.1285 0.02713 -0.7007 1
## percent_asian -0.3599 0.1683 -0.3351 -0.1387
## percent_hispanic -0.3684 0.2399 -0.7571 0.2246
## per_capita_income -0.3753 0.2334 0.3374 -0.3364
## median_rent -0.671 0.483 0.09245 -0.2376
## median_age 0.1715 0.02231 0.5211 -0.3594
## party 0.1442 -0.09026 0.2111 -0.1272
## gender -0.0003517 -0.01185 -0.05763 0.03318
## percent_asian percent_hispanic per_capita_income
## contb_receipt_amt Pearson Pearson Pearson
## date_upto_elec Pearson Pearson Pearson
## total_population Pearson Pearson Pearson
## latitude Pearson Pearson Pearson
## longitude Pearson Pearson Pearson
## percent_white Pearson Pearson Pearson
## percent_black Pearson Pearson Pearson
## percent_asian 1 Pearson Pearson
## percent_hispanic 0.1015 1 Pearson
## per_capita_income 0.1446 -0.3251 1
## median_rent 0.272 -0.08536 0.8205
## median_age -0.1945 -0.3729 0.1778
## party -0.1123 -0.1412 -0.1372
## gender 0.02402 0.04885 -0.05253
## median_rent median_age party gender
## contb_receipt_amt Pearson Pearson Polyserial Polyserial
## date_upto_elec Pearson Pearson Polyserial Polyserial
## total_population Pearson Pearson Polyserial Polyserial
## latitude Pearson Pearson Polyserial Polyserial
## longitude Pearson Pearson Polyserial Polyserial
## percent_white Pearson Pearson Polyserial Polyserial
## percent_black Pearson Pearson Polyserial Polyserial
## percent_asian Pearson Pearson Polyserial Polyserial
## percent_hispanic Pearson Pearson Polyserial Polyserial
## per_capita_income Pearson Pearson Polyserial Polyserial
## median_rent 1 Pearson Polyserial Polyserial
## median_age 0.004418 1 Polyserial Polyserial
## party -0.1394 0.2104 1 Polychoric
## gender -0.03283 -0.05711 0.2783 1
With the univariant analysis, now I will check the distribution across parties, candidates, genders, and occupations.
Contribution by Parties
In the first part, we know that more people would love to make donations for democrat party. In 2017 presidential campaign, democrat definitly got the big win compared to another two in New York State towards the money they got. Most people donate around 100 dolloars in both democrat and republican party, and people would love to donate more in the third party although there are less people involved in their donations. For the mean contribution amount, the rank was third party>republican>democrat, and IQ range has the same rank.
For the gender, more female donators were for in the democrat party, whereas in the republican party, the male donators totally dominate the contribution and they tend to donate more compared to the democrat party.The mean contribution amount as well as the IQ range is larger for males than females and they are significantly different.
## $female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04 15.00 27.00 120.90 75.00 10800.00
##
## $male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1 15.0 30.0 148.8 100.0 10800.0
##
## Wilcoxon rank sum test with continuity correction
##
## data: contb_receipt_amt by gender
## W = 3.3681e+10, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Contribution by Candidates
From the general plots, the top three candidates that received the most donations are Hilary, Bernard, Trump, Cruz. On the other hand, based on the boxplot, we could see that most of the donations they got were below 500.
For the candidate behind each donation, we could notify that Hillary won the most donations, followed by Bernard and Trump in 2017 presidential campaign.
## # A tibble: 25 x 6
## # Groups: cand_nm [25]
## cand_nm party sum mean median
## <fctr> <fctr> <dbl> <dbl> <dbl>
## 1 Clinton, Hillary Rodham democrat 145396196.3 373.4629 30.0
## 2 Sanders, Bernard democrat 8395372.6 48.7556 27.0
## 3 Trump, Donald J. republican 5853323.2 165.6617 65.7
## 4 Bush, Jeb republican 3709338.3 1623.3428 2700.0
## 5 Rubio, Marco republican 3009977.8 682.5346 100.0
## 6 Cruz, Rafael Edward 'Ted' republican 2055960.2 127.9935 45.0
## 7 Christie, Christopher J. republican 896712.0 1945.1453 2700.0
## 8 Kasich, John R. republican 888243.2 673.9327 250.0
## 9 Carson, Benjamin S. republican 709442.6 108.1303 50.0
## 10 Graham, Lindsey O. republican 515372.1 1758.9490 1500.0
## # ... with 15 more rows, and 1 more variables: n <int>
Contribution by Candidate of Each Party
If we dive deep into each candidate within party, we could see that there are more female support Hillary, which made the female become the majority in the democrat contributors.
Contribution by Occupation
In the big perspective, republican got more donations from retired group than democrat, whereas democrat got more from employed and unemployed group.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## contb_receipt_amt employer
## contb_receipt_amt 1 Polyserial
## employer 0.002845 1
For the occupations, homemaker, CEO, real estate, and Lawer had the larger mean amount of distribution as well as the IQ range. In democrat donations, the mainforce were Retired, lawyer, CEO, homemaker, and consultant, and republican has the similar mainforce.
With respect to the top three candidiates, we could see that laywer put the most effort to support Hillary, followed by retired people. For candidate Berbard, unemployed people was the mainforce in the donation, and retired people played hard to support Trump.
Contribution by Employer
For the employer, people in several university employers are the mainforce in the democrat party, whereas homemakers made more contributions for the republican party. With respect to top 3 candidates, people in New York University and Columbia University donated the most for Hillary and Bernard, and homemaker contributed most for Trump.
Contribution by Date
First, I would love to check the correlation between the time and contribution. We could notice that the correlation is not that strong, but there are two peaks in this trend, and the second one is really near the final call of the election.
##
## Pearson's product-moment correlation
##
## data: as.numeric(campaign_trend$date) and campaign_trend$contb_receipt_amt
## t = -0.37751, df = 316760, p-value = 0.7058
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.004153131 0.002811662
## sample estimates:
## cor
## -0.000670743
If we look at each month of the year, what does the distribution look like? We could see in general there were peaks after the holiday month and there were more contributions in 2016 than 2015.
##
## 2013 2014 2015 2016
## 1 43 58426 575082
For each party contribution trend, we could see except they all had the second peak that nearby the election, the republican and third party were more fluctuate, and there was a huge increase for the third party contribution when the election was approaching, which I feel that there is a proportion people in last year do not want to support either Hilary or Trump.
Finally, zooming into the candidates perspective. Hillary had an increasing trend of the contribution, even after the election. For Trump, there was an obvious peak nearby the election, and suddenly dropped down.Bernard seems get more donations in the early 2016, but falled behind Trump nearby the election.
For all the demographic features behind the zipcode region, the income per capita and the median rent has a weak correlation with the contribution amount.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## contb_receipt_dt percent_white percent_black
## contb_receipt_dt 1 Polyserial Polyserial
## percent_white -0.002247 1 Pearson
## percent_black -0.001827 -0.7007 1
## percent_asian 0.003847 -0.3351 -0.1387
## percent_hispanic 0.002925 -0.7571 0.2246
## per_capita_income 0.008978 0.3374 -0.3364
## median_rent 0.01014 0.09245 -0.2376
## median_age -0.003933 0.5211 -0.3594
## contbr_year <NA> <NA> <NA>
## percent_asian percent_hispanic per_capita_income
## contb_receipt_dt Polyserial Polyserial Polyserial
## percent_white Pearson Pearson Pearson
## percent_black Pearson Pearson Pearson
## percent_asian 1 Pearson Pearson
## percent_hispanic 0.1015 1 Pearson
## per_capita_income 0.1446 -0.3251 1
## median_rent 0.272 -0.08536 0.8205
## median_age -0.1945 -0.3729 0.1778
## contbr_year <NA> <NA> <NA>
## median_rent median_age contbr_year
## contb_receipt_dt Polyserial Polyserial Polyserial
## percent_white Pearson Pearson Pearson
## percent_black Pearson Pearson Pearson
## percent_asian Pearson Pearson Pearson
## percent_hispanic Pearson Pearson Pearson
## per_capita_income Pearson Pearson Pearson
## median_rent 1 Pearson Pearson
## median_age 0.004418 1 Pearson
## contbr_year <NA> <NA> 1
Finally, I want to look at the contribution amount with the zipcode on the map. As what we see, most of their contribution are from New York city, but republican do have a certain proportion from the rural area in New York State.
Tip: Summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
There is no strong correlation between other features and contribution amount. Only party, median rent and income per capita are more correlated as their correlation is larger than 0.1.
For other features,
Gender: More female donators were for in the democrat party especially for Hillary, whereas in the republican party, the male donators totally dominate the contribution.
Occupation: Republican got more donations from retired group than democrat, whereas democrat got more from employed and unemployed group. The mainforce in the donation are Retired, lawyer, CEO, homemaker, and consultant. With respect to the top three candidiates, we could see that laywer put the most effort to support Hillary, followed by retired people. For candidate Berbard, unemployed people was the mainforce in the donation, and retired people played hard to support Trump.
Employer: People in several university employers are the mainforce in the democrat party, whereas homemakers made more contributions for the republican party. With respect to top 3 candidates, people in New York University and Columbia University donated the most for Hillary and Bernard, and homemaker contributed most for Trump.
-Dates: There were peaks after the holiday month and there were more contributions in 2016 than 2015.
The median rent strongly correlated to the income per capita, and the percentage of white is correlated to the percentage of black.
The median rent is the strongest correlated to the income per capita. For the contribution amount, the most correlated feature is income per capita, followed by the median rent.
Now, I will pull out several correlated features including income per capita, median rent, and party, and map them with the contribution amount.
As what we could see, the people with more income will contribute more in the campaign, males tend to donate more with the same income, and the third party contributors tend to donate more with the same income. It is reasonable and understandable. But for the people with income less than 2000, the donation were varied.
People with higher median rent would love to donate more, which is reasonable and same as the relationship with the income per capita. For the people whose rent is below 500 dollars, their donations were a little varied, but not that much.
In general, the contribution dropped off as the time passed by, but the count increased.
The map here is useful for the party to identify where those donations came from, and it might be useful for different part to identify their potential donators.
Except the income per capita, median rent, and party info, the gender info and the date info do strength their correlation with the contribution amount.
With the increase of either rent or income per capita, the contribution amount increase more for republican than democrat, which means that republican supporters in the same income level as the democrat supporters were generous.
With the election approaching, the republican got more contributions especially nearby the election final call than the other parties in this democrat dominated states, which I guess they had the strategy in the whole country to get more people support themselves, which indirectly explain why Trump get into White House.
Here I would love to try the logistic regression method to predict a donor’s contributing party by their gender, income level, rent level, donation amount and number of days before the election (the way to transform the date to the numberic type).
model <- glm(party ~contb_receipt_amt+date_upto_elec+per_capita_income+median_rent+median_age+gender,family=binomial(link='logit'),data=train)
summary(model)
##
## Call:
## glm(formula = party ~ contb_receipt_amt + date_upto_elec + per_capita_income +
## median_rent + median_age + gender, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5474 -0.3424 -0.2610 -0.1946 3.6024
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.644e+00 8.558e-02 -89.324 <2e-16 ***
## contb_receipt_amt 3.771e-04 1.271e-05 29.668 <2e-16 ***
## date_upto_elec 2.585e-03 6.219e-05 41.559 <2e-16 ***
## per_capita_income -4.806e-06 5.344e-07 -8.993 <2e-16 ***
## median_rent -5.702e-04 4.545e-05 -12.545 <2e-16 ***
## median_age 1.205e-01 1.845e-03 65.319 <2e-16 ***
## gendermale 9.398e-01 1.722e-02 54.586 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 135473 on 345560 degrees of freedom
## Residual deviance: 124230 on 345554 degrees of freedom
## (4439 observations deleted due to missingness)
## AIC: 124244
##
## Number of Fisher Scoring iterations: 6
We could see that all the paramters that we selected are statistically significant, which means that they all play a role in the decision of which party they donate. When all other variable are the same, the more the amount of contribution or number of days before election or the median age, the more likely he or she is republican supporter. On the other hand, the higher income or the donator is female, he is more likely a democrat donator.
##
## model_pred_direction democrat republican
## democrat 158863 25134
## republican 605 141
## Accuracy: 0.8606767
The accuracy on the test set is 0.86, which is pretty good to predict their political support. But the accuracy cannot tell the actual performance of the model as the whole dataset is really imbalanced. But based on the model summary, we could see all the coefficients are little bit small as there are several numberical variables, it’s little bit hard to make precise prediction if our donators info are not precisely correct.
## # A tibble: 4,665 x 7
## # Groups: cand_nm, date [4,665]
## cand_nm date party median_date average_date sum_date
## <fctr> <date> <fctr> <dbl> <dbl> <dbl>
## 1 Bush, Jeb 2015-06-15 republican 2700 2280.952 47900
## 2 Bush, Jeb 2015-06-16 republican 750 1233.333 7400
## 3 Bush, Jeb 2015-06-17 republican 2700 2665.000 53300
## 4 Bush, Jeb 2015-06-18 republican 2700 2705.000 54100
## 5 Bush, Jeb 2015-06-19 republican 2700 2562.162 94800
## 6 Bush, Jeb 2015-06-20 republican 2700 2700.000 2700
## 7 Bush, Jeb 2015-06-21 republican 1850 1850.000 3700
## 8 Bush, Jeb 2015-06-22 republican 2700 2607.353 88650
## 9 Bush, Jeb 2015-06-23 republican 2700 2558.333 76750
## 10 Bush, Jeb 2015-06-24 republican 2700 2675.000 117700
## # ... with 4,655 more rows, and 1 more variables: count_date <int>
As what we could see, the democrat dominate the donations in New York State. In general, the campaign got most donations in 2017, and it seems having the seasonal contribution peak especially after the summer and winter holiday season. All the parties has the second peak in their donation trend, but its occurance is little bit different as the third party came much earlier and the republican came much later. For the top three candidates, we could see Clinton got the huge win in New York State. Sanders and Trump were paralleled in the amount of donations, but trump seems get more donations when approaching to the final call of the election.
The donator’s income level and rent level are positively correlated with the amount of contribution, which is hidden behind the zipcode info they provided. As most donations were from the New York City, it is really hard to identify what’s the difference spacially. But exploring those region’s demographic info, we could notify that republican supporters were more generous than the democrat supporters, althought democrat supporters dominate the New York Region.
As we could see, the main force in New York State Campaign were CEO, Homemaker, Lawyer, real estate, doctor, and professor. The actively involoved employers were from university, government agency, and several technology companies.
Although different parties have different supporters, they do have specific characteristics when speaking with the career info. For the democrat party, their supporters mostly were from University, and their occupations were more like retired people, CEO, lawyer, homemaker or consultant. For the republican party, their supporters were more from homemakers, self-employed, and they were more like homemakers and retired people.
In this project, I was determined and designed to identify what’s the reason behind the donations amount, and what’s the difference between the different parties supporters. The most correlated feature with the contribution were party, gender, median rent, income per capita, and the date of the donations. For the difference between different parties’ supporters, if a male tend to donate more than average who lives in the region with lower income and lower rent, he is more probably a republican supporter.
For the challenges and difficulties I met,
Data Preprocessing - The original dataset is little bit messy especially relates to the occupation, employer info, and geolocation info. For the geolocation info, I used the zipcode library to clean it and use the zipcode dataset to get the longitude and latitude based on the zipcode info. For the occupation info, I didn’t clean and categorize all of them and only focus on the top 25 of them to clean, analysis and plot.
Data Visualization - It’s really hard to organize the clear ananlytical and logical process before hand. So it takes time to try, to change, to find, to orgnize.
In order to visualize the geoinfo on the map, it takes me some time to find a way and visualize it. Still might not that clear about this part.
In order to output the pretty nice graph,the size of the output, the text size, the legend format all take me some time to explore.
Data Modeling - The plotting sometimes is not the direct way to find the relationship between variables. So I spent time to extract the info from the plot as well as try to identify the relationship based on the statistics method.
For the success,
Conclusion
By analyzing the financial campaign dataset, here are several intersting results that I found, - New York State were domiated by democrat supporters, who are retired people or professor in the university, Lawyer, homemaker, and doctors. - Hillary got the most donations, followed by Bernard, and Trump. - Female donators were little bit more in the democrat supporters, and males definitely dominate the republican donations. - In general, republican supporters were more generous and donated more than democrat if their income level is same. - Party, income level, rent level, and gender info impact how much money people would love to donate. Also, time is very important, as more people would love to donate in the period after the holiday seasons. - For democrat and republican supporters, they do have different characterics, especially with their career info. Homemaker is a huge part in the Trump donations, and university is the key in Hillary donations.
Shortcomings
Future